Abstract
We introduce $\pi^3$, a feed-forward neural network that offers a novelapproach to visual geometry reconstruction, breaking the reliance on aconventional fixed reference view. Previous methods often anchor theirreconstructions to a designated viewpoint, an inductive bias that can lead toinstability and failures if the reference is suboptimal. In contrast, $\pi^3$employs a fully permutation-equivariant architecture to predictaffine-invariant camera poses and scale-invariant local point maps without anyreference frames. This design makes our model inherently robust to inputordering and highly scalable. These advantages enable our simple and bias-freeapproach to achieve state-of-the-art performance on a wide range of tasks,including camera pose estimation, monocular/video depth estimation, and densepoint map reconstruction. Code and models are publicly available.